library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
library(visNetwork)
library(plotrix)

Description

This document provides a walkthrough of the analyis and decision-making steps taken in the development of the Spotify app. The desired app will show a network map of countries based on their shared songs, and allow the user to sample the shared songs. The visualization attempts to outline unique musical communities.

Data Source

The data used was provided by Charlie Thompson http://www.rcharlie.com/, who introduced the dataset and provided an exploratory analysis at the Data Viz meetup on December 5th, 2017.

The data of concern is the list of the top songs being listened to in each country, after removing the most common songs listend to worldwide. The data was initially pulled by the lecturer via the Spotify API and shared via Google Drive: https://drive.google.com/drive/folders/1fwSi6und-eY-yfVKRDpg0H6mwn33zrQo

Data PreProcessing

The goal of this section is to convert the data of top songs by country into network graph of connections between countries, with connections defined as the number of shared songs between countries. Using this data, we should be able to easily draw a network graph to depict musical communities.

Additionally, we will want a data frame containing lists of songs shared between countries. This will allow us to sample songs that are shared between a given set of countries.

We begin by loading and inspecting the data to determine which columns we need.

load("raw_data/geo_tracks.rData")
head(geo_tracks)
## # A tibble: 6 x 31
##                      playlist_name
##                              <chr>
## 1 The Sound of 's-hertogenbosch NL
## 2 The Sound of 's-hertogenbosch NL
## 3 The Sound of 's-hertogenbosch NL
## 4 The Sound of 's-hertogenbosch NL
## 5 The Sound of 's-hertogenbosch NL
## 6 The Sound of 's-hertogenbosch NL
## # ... with 30 more variables: playlist_img <chr>, track_name <chr>,
## #   track_uri <chr>, artist_name <chr>, album_name <chr>, album_img <chr>,
## #   track_popularity <int>, track_preview_url <chr>,
## #   track_open_spotify_url <chr>, danceability <dbl>, energy <dbl>,
## #   key <chr>, loudness <dbl>, mode <chr>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>,
## #   valence <dbl>, tempo <dbl>, duration_ms <dbl>, time_signature <dbl>,
## #   key_mode <chr>, city_country <chr>, city <chr>, country_abb <chr>,
## #   iso3c <chr>, region <chr>, continent <chr>, country <chr>

We’ll want the track_uri to get our distinct counts of shared tracks, and the country and city columns for grouping. We’ll also include continent and region to give us an extra dimension to visualize the data with, if desired. This should give us sufficient information to create our network map with. We’ll also want the track name, artist name, album name, album image, and track preview url so we can have a surface details about a song when called.

geo_tracks <- geo_tracks %>% select(city_country, country, region, continent, 
    track_uri, track_name, artist_name, album_name, album_img, track_preview_url)

We need to inspect the data before processing to make sure we’re not surprised in the end by missing data.

colSums(is.na(geo_tracks))
##      city_country           country            region         continent 
##                 0                 0                 0                 0 
##         track_uri        track_name       artist_name        album_name 
##                 0                 0                 0                 0 
##         album_img track_preview_url 
##                 0             61815

Fortunately, most of our data is filled. There is a substantial chunk of missing preview urls, which means that there will be cases when we might not be able to sample shared tracks between countries if all the shared tracks lack a preview.

Because the rest of our data is issue-free, we can build our network data. We start with building a table that has every track mapped to country pairs, and another table with city pairs. This will allow us to sample songs based on which pair of countries or cities we’re interested in. We join the tables with a copy of themselves on the track uri and remove instances where it matched the exact row to itself.

geo_track_temp <- geo_tracks

track_country_colocation <- geo_tracks %>% inner_join(geo_track_temp, by = c(track_uri = "track_uri")) %>% 
    filter(country.x != country.y) %>% select(track_uri, country.x, country.y) %>% 
    unique()

track_city_colocation <- geo_tracks %>% inner_join(geo_track_temp, by = c(track_uri = "track_uri")) %>% 
    filter(city_country.x != city_country.y) %>% select(track_uri, city_country.x, 
    city_country.y) %>% unique()

To build our edge graphs, we summarise the number of distinct track uri’s by the country and city groups. We then remove all the rows that are duplicates of another row but for the country/city names existing in opposite columns.

country_network <- track_country_colocation %>% group_by(country.x, country.y) %>% 
    summarise(Weight = n_distinct(track_uri))

country_network <- country_network[!duplicated(t(apply(country_network, 1, sort))), 
    ]
colnames(country_network) <- c("Source", "Target", "weight")

city_network <- track_city_colocation %>% group_by(city_country.x, city_country.y) %>% 
    summarise(Weight = n_distinct(track_uri))

city_network <- city_network[!duplicated(t(apply(city_network, 1, sort))), ]
colnames(city_network) <- c("Source", "Target", "weight")

We build a node graph that contains each country, with its region and continent, and a node graph that contains each city, with its country, region, and continent

country_nodes <- unique(geo_tracks[, 2:4])
city_nodes <- unique(geo_tracks[, 1:4])

We also want a unique database of the tracks in our colocation tables, so we can access their information.

track_country_data <- unique(geo_tracks[complete.cases(geo_tracks) & geo_tracks$track_uri %in% 
    track_country_colocation$track_uri, 5:10])
track_city_data <- unique(geo_tracks[complete.cases(geo_tracks) & geo_tracks$track_uri %in% 
    track_city_colocation$track_uri, 5:10])

Data Exploration

The goal of this section is to test the different ways of vizualizing the networks to ensure that the output is informative and easy to read.

Ideally, we can use igraph to create a visually appealing network graph, and then pass it through visNetwork to allow for interactivity. We allow DataViz to map in the default, which appears to be a Fruchterman-Reingold layout structure.

Country

The country network graph is really aesthetically pleasing, with some logical clustering around same-language countries, with the Europeans forming sparser and odd relationships.

g <- graph_from_data_frame(d = country_network, vertices = country_nodes, directed = F)

## Edges
E(g)$width <- E(g)$weight/10
E(g)$id <- 1:ecount(g)

## Vertices Color nodes by continent
color1 <- c("red", "blue", "green", "purple", "orange")
V(g)$color <- color1[factor(V(g)$continent)]
color1 <- c("red", "blue", "green", "purple", "orange", "pink", "brown", "black", 
    "lightblue", "lightgreen", "grey", "pink", "yellow", "lightgrey")
V(g)$color <- color1[factor(V(g)$region)]

## Plot
plot(g, vertex.label = NA, layout = layout.fruchterman.reingold(g))

visIgraph(g, idToLabel = TRUE) %>% visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE)

City

This graph is a lot harder to parse. The visNetwork actually causes my computer to crash - not great. I think it might be intersting to play around with though, but it simply takes too long to render to be any fun.

h <- graph_from_data_frame(d = city_network, vertices = city_nodes, directed = F)

## Edges
E(h)$width <- E(h)$weight/10
E(h)$id <- 1:ecount(h)

## Vertices Color nodes by continent
color1 <- c("red", "blue", "green", "purple", "orange")
V(h)$color <- color1[factor(V(h)$continent)]
color1 <- c("red", "blue", "green", "purple", "orange", "pink", "brown", "black", 
    "lightblue", "lightgreen", "grey", "pink", "yellow", "lightgrey")
V(h)$color <- color1[factor(V(h)$region)]

## Plot
plot(h, vertex.label = NA, layout = layout.fruchterman.reingold(h))

visIgraph(h, idToLabel = TRUE) %>% visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE)

Save Data

We save the data we need to run the shiny app:

saveRDS(g, "data/country_network.rds")
saveRDS(h, "data/city_network.rds")
saveRDS(track_country_colocation, "data/track_country_colocation.rds")
saveRDS(track_country_data, "data/track_country_data.rds")
saveRDS(track_city_colocation, "data/track_city_colocation.rds")
saveRDS(track_city_data, "data/track_city_data.rds")

Citations and Credits

Thompson, Charlie http://www.rcharlie.com/